Goto

Collaborating Authors

 code review


SecureReviewer: Enhancing Large Language Models for Secure Code Review through Secure-aware Fine-tuning

Liu, Fang, Liu, Simiao, Zhu, Yinghao, Lian, Xiaoli, Zhang, Li

arXiv.org Artificial Intelligence

Identifying and addressing security issues during the early phase of the development lifecycle is critical for mitigating the long-term negative impacts on software systems. Code review serves as an effective practice that enables developers to check their teammates' code before integration into the codebase. To streamline the generation of review comments, various automated code review approaches have been proposed, where LLM-based methods have significantly advanced the capabilities of automated review generation. However, existing models primarily focus on general-purpose code review, their effectiveness in identifying and addressing security-related issues remains underexplored. Moreover, adapting existing code review approaches to target security issues faces substantial challenges, including data scarcity and inadequate evaluation metrics. To address these limitations, we propose SecureReviewer, a new approach designed for enhancing LLMs' ability to identify and resolve security-related issues during code review. Specifically, we first construct a dataset tailored for training and evaluating secure code review capabilities. Leveraging this dataset, we fine-tune LLMs to generate code review comments that can effectively identify security issues and provide fix suggestions with our proposed secure-aware fine-tuning strategy. To mitigate hallucination in LLMs and enhance the reliability of their outputs, we integrate the RAG technique, which grounds the generated comments in domain-specific security knowledge. Additionally, we introduce SecureBLEU, a new evaluation metric designed to assess the effectiveness of review comments in addressing security issues. Experimental results demonstrate that SecureReviewer outperforms state-of-the-art baselines in both security issue detection accuracy and the overall quality and practical utility of generated review comments.


Fine-Tuning Multilingual Language Models for Code Review: An Empirical Study on Industrial C# Projects

Begolli, Igli, Aksoy, Meltem, Neider, Daniel

arXiv.org Artificial Intelligence

Code review is essential for maintaining software quality but often time-consuming and cognitively demanding, especially in industrial environments. Recent advancements in language models (LMs) have opened new avenues for automating core review tasks. This study presents the empirical evaluation of monolingual fine-tuning on the performance of open-source LMs across three key automated code review tasks: Code Change Quality Estimation, Review Comment Generation, and Code Refinement. We fine-tuned three distinct models, CodeReviewer, CodeLlama-7B, and DeepSeek-R1-Distill, on a C\# specific dataset combining public benchmarks with industrial repositories. Our study investigates how different configurations of programming languages and natural languages in the training data affect LM performance, particularly in comment generation. Additionally, we benchmark the fine-tuned models against an automated software analysis tool (ASAT) and human reviewers to evaluate their practical utility in real-world settings. Our results show that monolingual fine-tuning improves model accuracy and relevance compared to multilingual baselines. While LMs can effectively support code review workflows, especially for routine or repetitive tasks, human reviewers remain superior in handling semantically complex or context-sensitive changes. Our findings highlight the importance of language alignment and task-specific adaptation in optimizing LMs for automated code review.


Grounded AI for Code Review: Resource-Efficient Large-Model Serving in Enterprise Pipelines

Mandal, Sayan, Jiang, Hua

arXiv.org Artificial Intelligence

Automated code review adoption lags in compliance-heavy settings, where static analyzers produce high-volume, low-rationale outputs, and naive LLM use risks hallucination and incurring cost overhead. We present a production system for grounded, PR-native review that pairs static-analysis findings with AST-guided context extraction and a single-GPU, on-demand serving stack (quantized open-weight model, multi-tier caching) to deliver concise explanations and remediation guidance. Evaluated on safety-oriented C/C++ standards, the approach achieves sub-minute median first-feedback (offline p50 build+LLM 59.8s) while maintaining competitive violation reduction and lower violation rates versus larger proprietary models. The architecture is decoupled: teams can adopt the grounding/prompting layer or the serving layer independently. A small internal survey (n=8) provides directional signals of reduced triage effort and moderate perceived grounding, with participants reporting fewer human review iterations. We outline operational lessons and limitations, emphasizing reproducibility, auditability, and pathways to broader standards and assisted patching.


Fine-Tuning LLMs to Analyze Multiple Dimensions of Code Review: A Maximum Entropy Regulated Long Chain-of-Thought Approach

Yu, Yongda, Shi, Guohao, Wu, Xianwei, He, Haochuan, Gu, XueMing, Zhao, Qianqian, Liu, Kui, Wang, Qiushi, Tian, Zhao, Shen, Haifeng, Rong, Guoping

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have shown great potential in supporting automated code review due to their impressive capabilities in context understanding and reasoning. However, these capabilities are still limited compared to human-level cognition because they are heavily influenced by the training data. Recent research has demonstrated significantly improved performance through fine-tuning LLMs with code review data. However, compared to human reviewers who often simultaneously analyze multiple dimensions of code review to better identify issues, the full potential of these methods is hampered by the limited or vague information used to fine-tune the models. This paper contributes MelcotCR, a chain-of-thought (COT) fine-tuning approach that trains LLMs with an impressive reasoning ability to analyze multiple dimensions of code review by harnessing long COT techniques to provide rich structured information. To address context loss and reasoning logic loss issues that frequently occur when LLMs process long COT prompts, we propose a solution that combines the Maximum Entropy (ME) modeling principle with pre-defined reasoning pathways in MelcotCR to enable more effective utilization of in-context knowledge within long COT prompts while strengthening the logical tightness of the reasoning process. Empirical evaluations on our curated MelcotCR dataset and the public CodeReviewer dataset reveal that a low-parameter base model, such as 14B Qwen2.5, fine-tuned with MelcotCR can surpass state-of-the-art methods in terms of the accuracy of detecting and describing code issues, with its performance remarkably on par with that of the 671B DeepSeek-R1 model.


Intuition to Evidence: Measuring AI's True Impact on Developer Productivity

Kumar, Anand, Khare, Vishal, Sharma, Deepak, Kumar, Satyam, Saini, Vijay, Yadav, Anshul, Jain, Sachendra, Rana, Ankit, Verma, Pratham, Meena, Vaibhav, Edubilli, Avinash

arXiv.org Artificial Intelligence

We present a comprehensive real-world evaluation of AI-assisted software development tools deployed at enterprise scale. Over one year, 300 engineers across multiple teams integrated an in-house AI platform (DeputyDev) that combines code generation and automated review capabilities into their daily workflows. Through rigorous cohort analysis, our study demonstrates statistically significant productivity improvements, including an overall 31.8% reduction in PR review cycle time. Developer adoption was strong, with 85% satisfaction for code review features and 93% expressing a desire to continue using the platform. Adoption patterns showed systematic scaling from 4% engagement in month 1 to 83% peak usage by month 6, stabilizing at 60% active engagement. Top adopters achieved a 61% increase in code volume pushed to production, contributing to approximately 30 to 40% of code shipped to production through this tool, accounting for an overall 28% increase in code shipment volume. Unlike controlled benchmark evaluations, our longitudinal analysis provides empirical evidence from production environments, revealing both the transformative potential and practical deployment challenges of integrating AI into enterprise software development workflows.


Previously on... Automating Code Review

Heumüller, Robert, Ortmeier, Frank

arXiv.org Artificial Intelligence

Modern Code Review (MCR) is a standard practice in software engineering, yet it demands substantial time and resource investments. Recent research has increasingly explored automating core review tasks using machine learning (ML) and deep learning (DL). As a result, there is substantial variability in task definitions, datasets, and evaluation procedures. This study provides the first comprehensive analysis of MCR automation research, aiming to characterize the field's evolution, formalize learning tasks, highlight methodological challenges, and offer actionable recommendations to guide future research. Focusing on the primary code review tasks, we systematically surveyed 691 publications and identified 24 relevant studies published between May 2015 and April 2024. Each study was analyzed in terms of tasks, models, metrics, baselines, results, validity concerns, and artifact availability. In particular, our analysis reveals significant potential for standardization, including 48 task metric combinations, 22 of which were unique to their original paper, and limited dataset reuse. We highlight challenges and derive concrete recommendations for examples such as the temporal bias threat, which are rarely addressed so far. Our work contributes to a clearer overview of the field, supports the framing of new research, helps to avoid pitfalls, and promotes greater standardization in evaluation practices.


DeputyDev -- AI Powered Developer Assistant: Breaking the Code Review Logjam through Contextual AI to Boost Developer Productivity

Khare, Vishal, Saini, Vijay, Sharma, Deepak, Kumar, Anand, Rana, Ankit, Yadav, Anshul

arXiv.org Artificial Intelligence

This study investigates the implementation and efficacy of DeputyDev, an AI-powered code review assistant developed to address inefficiencies in the software development process. The process of code review is highly inefficient for several reasons, such as it being a time-consuming process, inconsistent feedback, and review quality not being at par most of the time. Using our telemetry data, we observed that at TATA 1mg, pull request (PR) processing exhibits significant inefficiencies, with average pick-up and review times of 73 and 82 hours, respectively, resulting in a 6.2 day closure cycle. The review cycle was marked by prolonged iterative communication between the reviewing and submitting parties. Research from the University of California, Irvine indicates that interruptions can lead to an average of 23 minutes of lost focus, critically affecting code quality and timely delivery. To address these challenges, we developed DeputyDev's PR review capabilities by providing automated, contextual code reviews. We conducted a rigorous double-controlled A/B experiment involving over 200 engineers to evaluate DeputyDev's impact on review times. The results demonstrated a statistically significant reduction in both average per PR (23.09%) and average per-line-of-code (40.13%) review durations. After implementing safeguards to exclude outliers, DeputyDev has been effectively rolled out across the entire organisation. Additionally, it has been made available to external companies as a Software-as-a-Service (SaaS) solution, currently supporting the daily work of numerous engineering professionals. This study explores the implementation and effectiveness of AI-assisted code reviews in improving development workflow timelines and code.


Exploring the Potential of Large Language Models in Fine-Grained Review Comment Classification

Nguyen, Linh, Liu, Chunhua, Lin, Hong Yi, Thongtanunam, Patanamon

arXiv.org Artificial Intelligence

--Code review is a crucial practice in software development. As code review nowadays is lightweight, various issues can be identified, and sometimes, they can be trivial. Research has investigated automated approaches to classify review comments to gauge the effectiveness of code reviews. However, previous studies have primarily relied on supervised machine learning, which requires extensive manual annotation to train the models effectively. T o address this limitation, we explore the potential of using Large Language Models (LLMs) to classify code review comments. We assess the performance of LLMs to classify 17 categories of code review comments. Our results show that LLMs can classify code review comments, outperforming the state-of-the-art approach using a trained deep learning model. In particular, LLMs achieve better accuracy in classifying the five most useful categories, which the state-of-the-art approach struggles with due to low training examples. Rather than relying solely on a specific small training data distribution, our results show that LLMs provide balanced performance across high-and low-frequency categories. These results suggest that the LLMs could offer a scalable solution for code review analytics to improve the effectiveness of the code review process. Index T erms --code review, review comment classification, prompt engineering, large language models. Code Review (CR) is a practice in software development where developers review other developer's code changes asynchronously to find defects and suggest improvements [1]. Acting as a quality assurance gateway, CR has become mandatory in many prominent organizations, with developers reportedly spending 10-15% of their time on this task [2]. In practice, various types of concerns can be raised in CR comments, ranging from code styling to functional issues. As comments often trigger the improvements of code changes, the types of comments play a crucial role in the quality of CR. Constructive and actionable comments addressing quality-improving issues would positively contribute to CR's overall quality and code changes [3]-[5]. On the other hand, trivial or irrelevant comments can waste developers' time without improving the code changes [6].


Hallucinations in Code Change to Natural Language Generation: Prevalence and Evaluation of Detection Metrics

Liu, Chunhua, Lin, Hong Yi, Thongtanunam, Patanamon

arXiv.org Artificial Intelligence

Language models have shown strong capabilities across a wide range of tasks in software engineering, such as code generation, yet they suffer from hallucinations. While hallucinations have been studied independently in natural language and code generation, their occurrence in tasks involving code changes which have a structurally complex and context-dependent format of code remains largely unexplored. This paper presents the first comprehensive analysis of hallucinations in two critical tasks involving code change to natural language generation: commit message generation and code review comment generation. We quantify the prevalence of hallucinations in recent language models and explore a range of metric-based approaches to automatically detect them. Our findings reveal that approximately 50\% of generated code reviews and 20\% of generated commit messages contain hallucinations. Whilst commonly used metrics are weak detectors on their own, combining multiple metrics substantially improves performance. Notably, model confidence and feature attribution metrics effectively contribute to hallucination detection, showing promise for inference-time detection.\footnote{All code and data will be released upon acceptance.


An ML-based Approach to Predicting Software Change Dependencies: Insights from an Empirical Study on OpenStack

Arabat, Ali, Sayagh, Mohammed, Hassine, Jameleddine

arXiv.org Artificial Intelligence

As software systems grow in complexity, accurately identifying and managing dependencies among changes becomes increasingly critical. For instance, a change that leverages a function must depend on the change that introduces it. Establishing such dependencies allows CI/CD pipelines to build and orchestrate changes effectively, preventing build failures and incomplete feature deployments. In modern software systems, dependencies often span multiple components across teams, creating challenges for development and deployment. They serve various purposes, from enabling new features to managing configurations, and can even involve traditionally independent changes like documentation updates. To address these challenges, we conducted a preliminary study on dependency management in OpenStack, a large-scale software system. Our study revealed that a substantial portion of software changes in OpenStack over the past 10 years are interdependent. Surprisingly, 51.08% of these dependencies are identified during the code review phase-after a median delay of 5.06 hours-rather than at the time of change creation. Developers often spend a median of 57.12 hours identifying dependencies, searching among a median of 463 other changes. To help developers proactively identify dependencies, we propose a semi-automated approach that leverages two ML models. The first model predicts the likelihood of dependencies among changes, while the second identifies the exact pairs of dependent changes. Our proposed models demonstrate strong performance, achieving average AUC scores of 79.33% and 91.89%, and Brier scores of 0.11 and 0.014, respectively. Indeed, the second model has a good top-k recall across all types of pairs, while the top-k precision has room for improvement.